ποΈ COMPLETE ROADMAP: Building Text-to-Speech (TTS) & Speech-to-Text (STT) Models & Services
From Scratch to Production β Beginner β Advanced β Research Level
1. FOUNDATION PREREQUISITES
1.1 Mathematics
- Linear Algebra: Vectors, matrices, dot products, SVD, eigenvalues
- Why: All neural networks are matrix operations
- Calculus: Derivatives, gradients, chain rule, partial derivatives
- Why: Backpropagation relies on chain rule
- Probability & Statistics: Gaussian distributions, Bayesian inference, MLE, MAP
- Why: Acoustic models are probabilistic; language models use probability
- Signal Processing Mathematics
- Fourier Transform (DFT, FFT)
- Convolution theorem
- Z-Transform
- Nyquist-Shannon sampling theorem
- Windowing functions (Hamming, Hann, Blackman)
1.2 Programming Languages
- Python (primary β 90% of ML/audio research)
- NumPy, SciPy, Matplotlib
- OOP, functional patterns, async programming
- C++ (for low-latency inference engines)
- JavaScript/TypeScript (for web APIs and browser-based STT/TTS)
- Shell/Bash (for pipeline automation, data processing)
1.3 Deep Learning Foundations
- Forward & backward propagation
- Activation functions (ReLU, GELU, Sigmoid, Softmax)
- Optimizers (SGD, Adam, AdamW, Lion)
- Regularization (Dropout, BatchNorm, LayerNorm, Weight Decay)
- Loss functions (Cross-entropy, CTC, MSE, L1)
- Sequence modeling fundamentals
1.4 Audio/Signal Processing Basics
- What is sound? Pressure waves, frequency, amplitude
- Sample rate (8kHz, 16kHz, 22.05kHz, 44.1kHz, 48kHz)
- Bit depth (8-bit, 16-bit, 32-bit float)
- Mono vs stereo
- Audio file formats: WAV, MP3, FLAC, OGG, OPUS
- Waveform representation
- Time domain vs frequency domain
2. CORE CONCEPTS & WORKING PRINCIPLES
2.1 How Human Speech Works
Lungs β Air pressure β Vocal cords vibrate β Resonates in vocal tract β
Articulators shape sound (tongue, lips, teeth) β Acoustic wave β Air β Ear
- Phonemes: Smallest units of sound (~44 in English)
- Prosody: Rhythm, stress, intonation, tempo
- Coarticulation: Phonemes influence neighboring sounds
- Formants: Resonant frequencies of the vocal tract (F1, F2, F3...)
2.2 Speech-to-Text (STT) β Working Principle
Audio Input β Pre-processing β Feature Extraction β Acoustic Model β
Language Model β Decoder β Text Output
Step-by-step:
- Microphone captures pressure variations β digital signal (waveform)
- Pre-process: remove noise, normalize, apply VAD (Voice Activity Detection)
- Extract features: convert raw audio to MFCCs, Mel Spectrograms, or raw waveform
- Acoustic model: predict phoneme/subword probabilities at each timestep
- Language model: rescore sequences based on linguistic probability
- Decoder: find most likely word sequence (Viterbi, Beam Search, CTC Greedy)
- Post-processing: punctuation restoration, capitalization, speaker labeling
2.3 Text-to-Speech (TTS) β Working Principle
Text Input β Text Analysis β Linguistic Features β Acoustic Model β
Vocoder β Audio Waveform Output
Step-by-step:
- Input text normalization (numbers β words, abbreviations β full form)
- G2P (Grapheme-to-Phoneme): convert letters to phonemes
- Prosody prediction: duration, pitch, energy per phoneme
- Acoustic model: generate mel spectrogram from linguistic features
- Vocoder: convert mel spectrogram to raw audio waveform
- Post-processing: audio normalization, format encoding
2.4 Key Audio Representations
| Representation | Description | Used In |
|---|---|---|
| Raw Waveform | Time-domain amplitude samples | WaveNet, WaveGlow, Encodec |
| STFT Spectrogram | Frequency vs time (complex) | Analysis, source separation |
| Mel Spectrogram | Perceptually-scaled frequency | Tacotron, Whisper, FastSpeech |
| MFCC | Compressed mel cepstral coefficients | Traditional ASR, GMM-HMM |
| Log-Mel | Log of mel spectrogram | Whisper, wav2vec 2.0 |
| Codec Tokens | Discrete audio tokens | EnCodec, SoundStream, VALL-E |
3. STRUCTURED LEARNING PATH
PHASE 0: Signal Processing Foundations (4β6 weeks)
- Topic 1: Digital Audio Fundamentals
- Sampling and quantization
- Aliasing and anti-aliasing filters
- PCM encoding
- Practice: Load WAV files, plot waveforms with librosa/scipy
- Topic 2: Fourier Analysis
- Discrete Fourier Transform (DFT)
- Fast Fourier Transform (FFT) β Cooley-Tukey algorithm
- Short-Time Fourier Transform (STFT)
- Window size (frame length), hop size, overlap
- Griffin-Lim reconstruction algorithm
- Practice: Compute STFT, plot spectrograms, reconstruct audio
- Topic 3: Mel Scale & Perceptual Features
- Mel filter banks (triangular filters on Mel scale)
- MFCC computation pipeline:
- Pre-emphasis filter
- Framing + windowing
- FFT
- Mel filter bank
- Log compression
- DCT (Discrete Cosine Transform)
- Delta and delta-delta features
- Practice: Implement MFCC from scratch without librosa
- Topic 4: Audio Pre-processing
- Noise reduction (spectral subtraction, Wiener filter)
- Voice Activity Detection (VAD) β energy-based, WebRTC VAD, Silero VAD
- Audio normalization (peak, RMS, LUFS)
- Resampling (polyphase filters)
- Practice: Build an audio pre-processing pipeline
PHASE 1: Classical Speech Processing (3β4 weeks)
- Topic 5: Hidden Markov Models (HMM)
- Markov chains and state transitions
- HMM components: states, observations, transition matrix, emission matrix
- Three HMM problems:
- Evaluation β Forward algorithm
- Decoding β Viterbi algorithm
- Learning β Baum-Welch (EM algorithm)
- HMM for phoneme modeling
- Practice: Implement HMM for digit recognition
- Topic 6: Gaussian Mixture Models (GMM)
- Mixture of Gaussians
- EM algorithm for GMM training
- GMM-HMM acoustic models
- Speaker adaptation: MLLR, MAP adaptation
- Topic 7: N-gram Language Models
- Unigram, bigram, trigram
- Perplexity metric
- Smoothing: Laplace, Kneser-Ney, Good-Turing
- ARPA format language model files
- Topic 8: Classical Vocoders (TTS)
- Formant synthesis (rule-based)
- Concatenative TTS: unit selection
- STRAIGHT vocoder
- WORLD vocoder (F0 + spectral envelope + aperiodicity)
- Practice: Use WORLD vocoder to analyze and resynthesize speech
PHASE 2: Deep Learning for Speech (6β8 weeks)
- Topic 9: Recurrent Neural Networks
- Vanilla RNN and vanishing gradient problem
- LSTM (Long Short-Term Memory):
- Input gate, forget gate, output gate, cell state
- GRU (Gated Recurrent Unit)
- Bidirectional RNNs
- Practice: Build sequence-to-sequence model for toy TTS
- Topic 10: Convolutional Neural Networks for Audio
- 1D convolution for raw waveform
- 2D convolution for spectrograms
- Dilated causal convolutions (key for WaveNet)
- Depthwise separable convolutions
- Practice: Build CNN-based phoneme classifier
- Topic 11: Attention Mechanisms
- Dot-product attention
- Scaled dot-product attention
- Multi-head attention
- Self-attention vs cross-attention
- Location-sensitive attention (Tacotron)
- Practice: Implement attention from scratch
- Topic 12: Transformer Architecture
- Encoder-Decoder structure
- Positional encoding (sinusoidal, learned, RoPE, ALiBi)
- Feed-forward networks
- Layer normalization (Pre-LN vs Post-LN)
- Masked attention for autoregressive decoding
- Practice: Train a small Transformer on character sequences
- Topic 13: Connectionist Temporal Classification (CTC)
- The alignment problem in speech recognition
- CTC forward algorithm
- CTC loss and gradient
- CTC greedy and beam search decoding
- CTC + language model rescoring
- Practice: Train CTC model on TIMIT dataset
PHASE 3: Modern STT Systems (8β10 weeks)
- Topic 14: End-to-End ASR Architectures
- Listen, Attend and Spell (LAS)
- Deep Speech 1 & 2 (Baidu)
- Jasper, QuartzNet (NVIDIA)
- Conformer (combining CNN + Transformer)
- Architecture comparison: CTC vs Attention vs RNN-T
- Topic 15: Self-Supervised Learning for Speech
- Contrastive Predictive Coding (CPC)
- wav2vec / wav2vec 2.0 (Facebook/Meta)
- CNN feature encoder + Transformer context network
- Quantization module (product quantization)
- Contrastive loss with negative sampling
- HuBERT (Hidden Unit BERT)
- Offline clustering β pseudo-label generation
- BERT-style masked prediction
- WavLM: wav2vec 2.0 + denoising objective
- Practice: Fine-tune wav2vec 2.0 on custom dataset
- Topic 16: Whisper (OpenAI)
- Architecture: Encoder-Decoder Transformer
- Training data: 680,000 hours weakly supervised
- Input: 30-second log-Mel spectrogram (80 channels)
- Multitask training: transcription + translation + language ID + VAD
- Tokenizer: BPE with multilingual vocabulary
- Model sizes: tiny(39M), base(74M), small(244M), medium(769M), large(1.5B)
- Practice: Deploy Whisper, fine-tune on domain-specific data
- Topic 17: RNN-T (Recurrent Neural Network Transducer)
- Encoder (audio) + Prediction network (text) + Joint network
- Transducer loss function
- On-device streaming ASR
- Used by: Google, Apple, Amazon Alexa
- Practice: Train small RNN-T on LibriSpeech subset
- Topic 18: Streaming & Real-Time ASR
- Chunk-based processing
- Latency vs accuracy tradeoff
- Lookahead context
- Cache-aware streaming Conformer
- CTC prefix beam search for streaming
- Practice: Build real-time transcription with WebRTC + Whisper
PHASE 4: Modern TTS Systems (8β10 weeks)
- Topic 19: Neural TTS Pipeline
- Text normalization (written β spoken form)
- Number normalization
- Abbreviation expansion
- Date/time normalization
- G2P (Grapheme-to-Phoneme):
- Rule-based (CMU Pronouncing Dictionary)
- Sequence-to-sequence G2P
- Transformer G2P
- Phoneme inventory and IPA
- Prosody: F0 (pitch), duration, energy
- Text normalization (written β spoken form)
- Topic 20: Tacotron & Tacotron 2
- Tacotron 1: CBHG + attention + Griffin-Lim
- Tacotron 2:
- Encoder: Conv layers + BiLSTM
- Attention: Location-sensitive
- Decoder: Autoregressive LSTM β mel spectrogram
- Stop token prediction
- WaveNet vocoder
- Practice: Train Tacotron 2 on LJ Speech dataset
- Topic 21: FastSpeech & FastSpeech 2
- FastSpeech 1: Knowledge distillation from autoregressive teacher
- Feed-forward Transformer (FFT)
- Length regulator (phoneme duration)
- Parallel mel generation (non-autoregressive)
- FastSpeech 2: No teacher-forcing
- Duration predictor
- Pitch predictor (F0)
- Energy predictor
- Variance adaptor
- Speed: 270x faster than Tacotron for inference
- Practice: Train FastSpeech 2 on LJ Speech
- FastSpeech 1: Knowledge distillation from autoregressive teacher
- Topic 22: VITS (Variational Inference TTS)
- End-to-end: text β waveform in one model
- Components: posterior encoder, prior encoder, decoder (HiFi-GAN)
- Variational autoencoder (VAE) latent space
- Normalizing flows (affine coupling layers)
- GAN training for waveform quality
- Stochastic duration predictor
- Practice: Train VITS, experiment with fine-tuning on custom voice
- Topic 23: Neural Vocoders
- WaveNet: Autoregressive dilated causal CNN, slow but high quality
- WaveGlow: Normalizing flow, parallel generation
- MelGAN: GAN-based, fast, lightweight
- HiFi-GAN: Multi-period discriminator + multi-scale discriminator, best quality/speed
- BigVGAN: Large-scale HiFi-GAN with anti-aliased activations
- EnCodec: Neural audio codec (RVQ-based), used as tokenizer
- Practice: Train HiFi-GAN on LJ Speech
- Topic 24: Voice Cloning
- Speaker embeddings: d-vector, x-vector, ECAPA-TDNN
- Speaker verification vs identification
- Zero-shot voice cloning: YourTTS, XTTS, OpenVoice
- Few-shot voice cloning: 3β10 seconds of reference audio
- Speaker encoder: GE2E loss (generalized end-to-end loss)
- Practice: Implement zero-shot voice cloning with XTTS
PHASE 5: Large-Scale Models & Advanced Techniques (8β12 weeks)
- Topic 25: Language Models for TTS & STT
- VALL-E: TTS as a language modeling task
- EnCodec tokens (8 RVQ levels)
- AR model for coarse tokens + NAR for fine tokens
- In-context learning for voice cloning
- AudioLM: Audio continuation using hierarchical tokens
- SoundStorm: Non-autoregressive audio generation
- Voicebox: Flow-matching-based TTS
- VALL-E: TTS as a language modeling task
- Topic 26: Diffusion Models for Speech
- Score-based generative models
- Denoising Diffusion Probabilistic Models (DDPM)
- DiffWave: diffusion-based vocoder
- Grad-TTS: diffusion-based acoustic model
- Stable Diffusion concepts applied to audio
- DDIM sampling for fast inference
- Topic 27: Flow Matching
- Continuous normalizing flows
- Flow matching vs diffusion: faster training, ODE-based
- Voicebox (Meta): flow matching for TTS
- Matcha-TTS: ODE-based TTS
- E2-TTS / F5-TTS: flow matching with flat text input
- Topic 28: Multilingual & Code-Switching
- Multilingual acoustic models
- Language identification integration
- Code-switching (mixing languages mid-sentence)
- MMS (Meta Massively Multilingual Speech): 1000+ languages
- Cross-lingual transfer learning
- Low-resource language adaptation
- Topic 29: Emotion & Style Control
- Emotion embeddings (happy, sad, angry, neutral...)
- Global Style Tokens (GST)
- Reference audio-based style transfer
- Prosody transfer
- Voice conversion (change voice, keep content)
- Practice: Build emotion-controlled TTS using GST-Tacotron
PHASE 6: Production & MLOps (4β6 weeks)
- Topic 30: Model Optimization
- Quantization: INT8, INT4, dynamic quantization
- Pruning: structured, unstructured, magnitude-based
- Knowledge distillation for smaller models
- ONNX export and ONNX Runtime
- TensorRT optimization (NVIDIA)
- OpenVINO (Intel)
- Edge deployment: TFLite, CoreML, NCNN
- Topic 31: Inference Optimization
- Batching strategies (dynamic batching)
- Caching (KV cache, encoder cache)
- Speculative decoding
- CTranslate2 for faster Transformer inference
- Triton Inference Server
- TorchScript and torch.compile
- Topic 32: Service Architecture
- REST API design (FastAPI, Flask)
- WebSocket for real-time streaming
- gRPC for high-performance RPC
- Message queues (RabbitMQ, Kafka) for async processing
- Load balancing and horizontal scaling
- Rate limiting and API key management
- CDN for audio delivery
- Topic 33: MLOps Pipeline
- Experiment tracking: MLflow, Weights & Biases
- Data versioning: DVC
- Model registry and versioning
- CI/CD for ML models
- Monitoring: model drift, latency, error rate
- A/B testing for TTS quality
- Data flywheel and continuous improvement
4. ALGORITHMS, TECHNIQUES & TOOLS
4.1 Core Algorithms
STT Algorithms
| Algorithm | Type | Key Use |
|---|---|---|
| Viterbi | Dynamic Programming | HMM decoding, best path |
| Baum-Welch | EM | HMM training |
| CTC Forward-Backward | DP | CTC loss computation |
| Beam Search | Tree Search | Sequence decoding |
| Prefix Beam Search | Tree Search | CTC with LM integration |
| WFST (Weighted FST) | Graph | Kaldi-style decoding |
| BPE (Byte Pair Encoding) | Tokenization | Subword vocabulary |
| Word2Vec/FastText | Embedding | Text representation |
| Forced Alignment | DP | Aligning audio to transcripts |
TTS Algorithms
| Algorithm | Type | Key Use |
|---|---|---|
| Griffin-Lim | Phase reconstruction | Spectrogram β waveform |
| WORLD vocoder | Signal processing | Parametric voice synthesis |
| VAE | Generative | Latent space for style |
| Normalizing Flows | Generative | Invertible transformations |
| GAN | Generative | Waveform generation, vocoders |
| DDPM | Generative | Diffusion vocoders |
| Flow Matching | Generative | Fast TTS (F5-TTS, Voicebox) |
| RVQ (Residual Vector Quantization) | Compression | Audio tokenization |
4.2 Neural Network Architectures
- CNN: WaveNet, DeepSpeech, Jasper, QuartzNet
- LSTM/GRU: Tacotron, early E2E ASR
- Transformer: Whisper, FastSpeech, wav2vec 2.0
- Conformer: SOTA for ASR (CNN + Self-attention hybrid)
- Diffusion U-Net: DiffWave, Grad-TTS
- Flow network: WaveGlow, Glow-TTS, VITS
- Codec model: EnCodec, SoundStream, DAC
4.3 Training Techniques
- Teacher Forcing: train decoder with ground truth
- Scheduled Sampling: gradually mix teacher/model predictions
- Knowledge Distillation: teacher-student training
- Contrastive Learning: wav2vec, SimCLR-style
- Multi-task Learning: Whisper (transcription + translation + LID)
- Transfer Learning: fine-tune pretrained models
- Data Augmentation:
- SpecAugment (time/frequency masking)
- Speed perturbation (0.9x, 1.0x, 1.1x)
- Room Impulse Response (RIR) convolution
- Additive noise (MUSAN, AudioSet)
- Pitch shifting, time stretching
4.4 Python Libraries & Frameworks
Audio Processing
librosa β Audio analysis, feature extraction, visualization
soundfile β Read/write audio files (WAV, FLAC, OGG)
pydub β Audio manipulation (cut, join, convert)
scipy.signal β Signal processing primitives
torchaudio β PyTorch audio I/O and transforms
audioread β Backend-agnostic audio reading
pyworld β Python wrapper for WORLD vocoder
resampy β High-quality audio resampling
webrtcvad β Google's WebRTC VAD
silero-vad β Neural VAD (accurate, fast)
Deep Learning
PyTorch β Primary framework for research/production
TensorFlow β Production, mobile (TFLite)
JAX/Flax β Google research framework
HuggingFace Transformers β Pre-trained models hub
HuggingFace Datasets β Dataset loading/processing
STT-Specific
openai-whisper β OpenAI Whisper (all sizes)
faster-whisper β CTranslate2-optimized Whisper (4x faster)
whisperx β Whisper + word-level alignment
nemo (NVIDIA NeMo) β ASR, TTS, NLP toolkit
espnet β End-to-end speech processing
kaldi β Classical + hybrid ASR
speechbrain β PyTorch speech toolkit
wav2letter++ β Meta's ASR toolkit
deepgram β Commercial STT API (also research)
vosk β Offline STT (lightweight)
TTS-Specific
TTS (Coqui) β Open-source TTS: Tacotron, VITS, XTTS
espeak-ng β Lightweight rule-based TTS (G2P)
pyttsx3 β Offline TTS wrapper
bark β Suno's generative TTS (GPT-style)
tortoise-tts β Slow but high-quality multi-voice TTS
XTTS / Coqui XTTS β Multilingual voice cloning (VITS-based)
StyleTTS2 β Style-based TTS (SOTA on LJ Speech)
parler-tts β Description-controlled TTS
kokoro-tts β Lightweight high-quality TTS
Serving & Infrastructure
FastAPI β Async Python web framework
uvicorn β ASGI server
triton β NVIDIA model serving
onnxruntime β Cross-platform model inference
ctranslate2 β Efficient Transformer inference
ray serve β Distributed model serving
celery β Async task queue
redis β Caching, pub/sub, queue
5. ARCHITECTURE DEEP DIVE
5.1 STT Architecture Family Tree
Classical Era
βββ GMM-HMM (1990sβ2010s)
β βββ Feature: MFCC
β βββ Acoustic: GMM per HMM state
β βββ Decoder: Viterbi + N-gram LM
β
βββ DNN-HMM (2012β2016)
β βββ Feature: MFCC / fbank
β βββ Acoustic: DNN replaces GMM
β βββ Decoder: Viterbi + WFST
β
End-to-End Era
βββ CTC-Based (2014β2019)
β βββ DeepSpeech 1 & 2: RNN + CTC
β βββ Jasper: CNN + CTC
β βββ QuartzNet: Depthwise sep CNN + CTC
β
βββ Attention-Based (2016β2020)
β βββ LAS: LSTM encoder + attention decoder
β βββ Transformer ASR: Self-attention encoder + decoder
β
βββ Hybrid CTC-Attention (2017βpresent)
β βββ ESPnet models, Conformer
β
Self-Supervised Era
βββ wav2vec 2.0 (2020): CNN + Transformer + contrastive
βββ HuBERT (2021): CNN + Transformer + BERT-style
βββ WavLM (2022): HuBERT + denoising
βββ Whisper (2022): Supervised multitask, Enc-Dec Transformer
β
Streaming / On-device
βββ RNN-T: encoder + predictor + joiner
βββ Streaming Conformer: chunk-based
βββ Distil-Whisper: 6x faster distilled version
5.2 Conformer Architecture (SOTA for ASR)
Input Audio β Log-Mel Spectrogram (80 dims) β Conv Subsampling (4x)
β Linear Projection β [Conformer Block Γ N] β CTC / Attention Head
Conformer Block:
Input
β
Feed-Forward Module (Β½ scaling)
β
Multi-Head Self-Attention Module
β
Convolution Module (depthwise)
β
Feed-Forward Module (Β½ scaling)
β
LayerNorm
β
Output
Convolution Module:
LayerNorm β Pointwise Conv β GLU β Depthwise Conv β
BatchNorm β Swish activation β Pointwise Conv β Dropout
5.3 Whisper Architecture Detail
Encoder:
Log-Mel Spectrogram (80 Γ 3000 frames for 30s)
β 2Γ Conv1D (stride 1, 2) + GELU
β Sinusoidal Positional Encoding
β Transformer Encoder Blocks (6β32 layers depending on model)
Each block: Self-Attention + FFN + LayerNorm (pre-norm)
Decoder:
Special tokens: <|startoftranscript|> <|language|> <|task|> <|notimestamps|>
β Token Embedding + Learned Positional Encoding
β Transformer Decoder Blocks (6β32 layers)
Each block: Masked Self-Attention + Cross-Attention + FFN
β Linear β Softmax over vocab (51865 tokens)
5.4 TTS Architecture Family Tree
Classical Era
βββ Formant synthesis (rule-based, 1960sβ)
βββ Concatenative TTS (unit selection, 1990sβ)
β βββ Record many hours β select and concatenate units
βββ HMM-based TTS (HTS, 2000sβ)
βββ STRAIGHT/WORLD vocoder
Neural Era
βββ Seq2Seq + Attention
β βββ Tacotron 1 (2017): CBHG + Griffin-Lim
β βββ Tacotron 2 (2017): BiLSTM + WaveNet vocoder
β
βββ Parallel / Non-autoregressive
β βββ FastSpeech 1 (2019): FFT + duration from teacher
β βββ FastSpeech 2 (2020): Duration/pitch/energy predictors
β βββ SpeedySpeech (2020)
β βββ JETS (2022): E2E with alignment learning
β
βββ Normalizing Flow Based
β βββ Glow-TTS (2020): Flow-based alignment + generation
β βββ VITS (2021): E2E VAE + flows + HiFi-GAN
β
βββ Diffusion Based
β βββ DiffTTS (2021)
β βββ Grad-TTS (2021): Score-based diffusion
β βββ NaturalSpeech (2022): VITS + diffusion
β
LLM/Codec Era (2023βpresent)
βββ VALL-E (2023): AR + NAR codec language model
βββ SPEAR-TTS (2023): Self-supervised TTS
βββ Voicebox (2023): Flow matching
βββ NaturalSpeech 3 (2024): FACodec + diffusion
βββ F5-TTS (2024): Flow matching + Flat text
βββ CosyVoice (2024): LLM + flow matching
5.5 VITS Architecture Detail (Recommended Starting Point)
TEXT INPUT
β
[Text Encoder]
Phoneme embedding β Transformer encoder β Prior distribution ΞΌ,Ο
[Stochastic Duration Predictor]
Flow-based duration prediction
[Length Regulator]
Expand phoneme representations to frame length
[Decoder / Flow-based Posterior]
VAE encoder: mel β latent z
Normalizing flows: transforms z
[HiFi-GAN Generator] (Vocoder)
z β raw waveform
[Discriminators] (training only)
Multi-Period Discriminator (MPD)
Multi-Scale Discriminator (MSD)
LOSS = Mel loss + KL divergence + Duration loss + GAN loss + Feature matching loss
5.6 HiFi-GAN Architecture Detail
Generator:
Input: Mel Spectrogram (80 Γ T)
β Transposed Conv (Γ4 upsample) β MRF Block β repeat until audio rate
MRF Block = Multi-Receptive Field Fusion
= ResBlock(k=3) + ResBlock(k=7) + ResBlock(k=11)
Each ResBlock: dilated conv with rates [1,3,5]
Output: Raw waveform at 22050Hz
Multi-Period Discriminator (MPD):
Periods p = [2, 3, 5, 7, 11]
Reshape waveform into (T/p, p) β Conv2D per period
Multi-Scale Discriminator (MSD):
Operate at 3 scales: raw, Γ2 avg pooled, Γ4 avg pooled
6. DESIGN & DEVELOPMENT PROCESS
6.1 STT Development from Scratch
Step 1: Data Collection & Preparation
Sources:
- LibriSpeech: 960h clean English (openslr.org)
- CommonVoice: Mozilla multilingual crowdsourced
- VoxPopuli: EU parliament recordings
- FLEURS: Google multilingual
- Custom: Record, transcribe, verify
Pipeline:
raw_audio β segment_by_vad β normalize_loudness β
resample_to_16kHz β verify_transcript β create_manifest_json
Manifest format:
{"audio_filepath": "path/to/audio.wav", "duration": 3.2, "text": "hello world"}
Step 2: Feature Extraction
import torchaudio
import torchaudio.transforms as T
def extract_mel_spectrogram(waveform, sample_rate=16000):
mel_transform = T.MelSpectrogram(
sample_rate=sample_rate,
n_fft=400, # ~25ms window at 16kHz
hop_length=160, # ~10ms hop
n_mels=80,
f_min=80,
f_max=7600
)
log_mel = torch.log(mel_transform(waveform) + 1e-9)
return log_mel # Shape: (80, T)
Step 3: Model Architecture (Conformer CTC)
class ConformerASR(nn.Module):
def __init__(self, input_dim=80, vocab_size=29, d_model=256, num_heads=4, num_layers=6):
super().__init__()
self.conv_subsample = Conv2dSubsampling(input_dim, d_model)
self.encoder = nn.ModuleList([
ConformerBlock(d_model, num_heads) for _ in range(num_layers)
])
self.ctc_head = nn.Linear(d_model, vocab_size)
def forward(self, x, x_lengths):
x, x_lengths = self.conv_subsample(x, x_lengths)
for block in self.encoder:
x = block(x)
logits = self.ctc_head(x)
return logits, x_lengths
Step 4: Training Loop
from torch.nn import CTCLoss
criterion = CTCLoss(blank=0, zero_infinity=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
for batch in dataloader:
audio, audio_len, tokens, token_len = batch
logits, out_len = model(audio, audio_len)
log_probs = F.log_softmax(logits.transpose(0,1), dim=-1)
loss = criterion(log_probs, tokens, out_len, token_len)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad()
Step 5: Decoding
# Greedy CTC decoding
def greedy_decode(logits, blank_id=0):
predicted = torch.argmax(logits, dim=-1)
decoded = []
prev = blank_id
for p in predicted:
if p != blank_id and p != prev:
decoded.append(p.item())
prev = p
return decoded
# Beam search with language model (use pyctcdecode)
from pyctcdecode import build_ctcdecoder
decoder = build_ctcdecoder(vocab, kenlm_model="lm.arpa", alpha=0.5, beta=1.0)
text = decoder.decode(logits.numpy())
Step 6: Evaluation
from jiwer import wer, cer
# Word Error Rate
error_rate = wer(reference_texts, hypothesis_texts)
char_error = cer(reference_texts, hypothesis_texts)
print(f"WER: {error_rate:.2%}, CER: {char_error:.2%}")
6.2 TTS Development from Scratch
Step 1: Data Collection & Preparation
Datasets:
- LJ Speech: 24h single speaker English (ljspeech.github.io)
- VCTK: 109 English speakers
- LibriTTS: 585h multi-speaker
- HiFi-TTS: High quality multi-speaker
- Custom: Studio-quality recording (quiet room, good mic)
Recording specs for custom:
- 44.1kHz or 48kHz, 24-bit, mono
- Acoustic treatment (no echo/reverb)
- Consistent mic distance (15β20cm)
- Phonetically balanced scripts
- 1β10 hours for fine-tuning; 20+ for training from scratch
Preprocessing:
audio β normalize_to_-20dBFS β trim_silence β resample_22050Hz β
extract_mel β create_filelists (train|val|test)
Step 2: Text Frontend
import phonemizer
from phonemizer.backend import EspeakBackend
backend = EspeakBackend('en-us', preserve_punctuation=True, with_stress=True)
def text_to_phonemes(text):
# Normalize text first
text = normalize_numbers(text) # "123" β "one hundred twenty three"
text = expand_abbreviations(text) # "Dr." β "Doctor"
# Convert to phonemes
phonemes = backend.phonemize([text])[0]
return phonemes
# Phoneme to ID mapping
phoneme_to_id = {p: i for i, p in enumerate(PHONEME_SET)}
Step 3: FastSpeech 2 Model
class FastSpeech2(nn.Module):
def __init__(self):
super().__init__()
self.encoder = FFTTransformer(n_layers=4, d_model=256, n_heads=2)
self.variance_adaptor = VarianceAdaptor(d_model=256)
self.decoder = FFTTransformer(n_layers=4, d_model=256, n_heads=2)
self.mel_linear = nn.Linear(256, 80)
def forward(self, phoneme_ids, duration_target=None, pitch_target=None, energy_target=None):
x = self.encoder(phoneme_ids)
x, duration, pitch, energy = self.variance_adaptor(
x, duration_target, pitch_target, energy_target
)
x = self.decoder(x)
mel = self.mel_linear(x)
return mel, duration, pitch, energy
Step 4: HiFi-GAN Vocoder Training
# Train HiFi-GAN on mel β waveform
# Generator loss
mel_loss = F.l1_loss(mel_fake, mel_real)
gan_loss = generator_adversarial_loss(disc_fake_outputs)
feature_match = feature_matching_loss(disc_real_features, disc_fake_features)
loss_G = mel_loss * 45 + gan_loss + feature_match * 2
# Discriminator loss
loss_D = discriminator_loss(disc_real_outputs, disc_fake_outputs)
Step 5: End-to-End Inference Pipeline
class TTSPipeline:
def __init__(self, tts_model, vocoder):
self.tts = tts_model
self.vocoder = vocoder
def synthesize(self, text, speed=1.0):
# 1. Text β phonemes
phonemes = text_to_phonemes(text)
phoneme_ids = text_to_ids(phonemes)
# 2. TTS model β mel spectrogram
with torch.no_grad():
mel, *_ = self.tts(
torch.LongTensor(phoneme_ids).unsqueeze(0),
d_control=speed
)
# 3. Vocoder β waveform
with torch.no_grad():
audio = self.vocoder(mel)
return audio.squeeze().cpu().numpy()
7. REVERSE ENGINEERING EXISTING SYSTEMS
7.1 Why Reverse Engineering?
- Learn from production-grade code
- Understand design decisions
- Identify optimizations for your use case
- Build intuition faster than pure theory
7.2 How to Reverse Engineer Whisper
Step 1: Read the Paper
- "Robust Speech Recognition via Large-Scale Weak Supervision" (Radford et al. 2022)
- Note: architecture section, training details, data section
Step 2: Clone and Explore Code
git clone https://github.com/openai/whisper
# Key files:
# whisper/model.py β Architecture
# whisper/audio.py β Feature extraction
# whisper/decoding.py β Beam search decoder
# whisper/tokenizer.py β BPE tokenizer
Step 3: Trace Forward Pass
import whisper
model = whisper.load_model("tiny")
# Trace: audio β features
audio = whisper.load_audio("speech.wav")
mel = whisper.log_mel_spectrogram(audio) # (80, 3000)
# Encoder
encoded = model.encoder(mel.unsqueeze(0)) # (1, 1500, 384) for tiny
# Decoder (autoregressive)
tokens = [model.tokenizer.sot] # start of transcript token
for _ in range(100):
logits = model.decoder(torch.tensor([tokens]), encoded)
next_token = logits[0, -1].argmax().item()
if next_token == model.tokenizer.eot:
break
tokens.append(next_token)
Step 4: Profile Bottlenecks
import torch.profiler
with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
result = model.transcribe("audio.wav")
print(prof.key_averages().table(sort_by="cuda_time_total"))
# Identify: encoder dominates (80%), decoder is 20%
# Optimization: cache encoder, optimize decoder attention
Step 5: Rebuild from Scratch (Your Understanding)
# After studying, rebuild each component:
class MultiHeadAttention(nn.Module):
# Implement from scratch based on understanding
...
class ResidualAttentionBlock(nn.Module):
# Implement encoder block
...
# Compare outputs to original model
torch.testing.assert_close(your_output, original_output, atol=1e-4, rtol=1e-4)
7.3 How to Reverse Engineer VITS
git clone https://github.com/jaywalnut310/vits
# Key files:
# models.py β SynthesizerTrn (full model)
# attentions.py β Transformer blocks
# modules.py β WN (WaveNet-style), ResidualCouplingBlock (flows)
# monotonic_align/ β MAS (Monotonic Alignment Search)
# mel_processing.py β Mel spectrogram computation
Key insight from VITS code:
- SynthesizerTrn.infer() is inference path (no VAE encoder needed)
- SynthesizerTrn.forward() is training path (requires mel as target)
- monotonic_align.maximum_path() is the alignment algorithm (Cython)
7.4 Reverse Engineering Approach Template
- READ paper abstract + architecture section β mental model
- CLONE repository β understand file structure
- TRACE data flow (print shapes at each step)
- IDENTIFY key components β isolate each into test
- REPRODUCE in clean code from memory
- VERIFY outputs match original
- EXPERIMENT: change hyperparameters, observe effects
- OPTIMIZE: profile, identify bottlenecks, improve
8. HARDWARE REQUIREMENTS
8.1 Development Hardware
Minimum (Learning & Experimentation)
CPU: Intel Core i7 / AMD Ryzen 7 (8+ cores)
RAM: 16GB (32GB preferred)
GPU: NVIDIA RTX 3060 (12GB VRAM) or RTX 3070
Storage: 500GB SSD (NVMe preferred)
Note: Can fine-tune small models, run inference on all models
Recommended (Training Medium Models)
CPU: Intel Core i9 / AMD Ryzen 9 / Threadripper
RAM: 64GB DDR4/DDR5
GPU: NVIDIA RTX 3090 (24GB) or RTX 4090 (24GB) β single GPU
Storage: 2TB NVMe SSD + 4TB HDD for datasets
Cost: ~$3,000β$5,000
Note: Train FastSpeech2, HiFi-GAN, Conformer from scratch on LJ Speech
Research-Grade (Training Large Models)
GPU: 4Γ RTX 4090 or 4Γ A100 (40GB or 80GB)
RAM: 256GB
Storage: 10TB+ NVMe
Network: 100GbE for distributed training
Cost: $15,000β$40,000
Note: Train VITS, Conformer on LibriSpeech 960h
Cloud (Production Training)
AWS:
p3.2xlarge β 1Γ V100 16GB ($3.06/hr)
p3.8xlarge β 4Γ V100 64GB ($12.24/hr)
p4d.24xlarge β 8Γ A100 40GB ($32.77/hr)
Google Cloud:
a2-highgpu-1g β 1Γ A100 40GB ($3.67/hr)
a2-highgpu-8g β 8Γ A100 40GB ($29.39/hr)
Lambda Labs (cheapest GPU cloud):
1Γ A100 80GB ~$1.29/hr
8Γ A100 80GB ~$10.32/hr
Use Spot/Preemptible instances for ~60-70% discount
8.2 VRAM Requirements by Model
| Model | Task | VRAM (Training) | VRAM (Inference) |
|---|---|---|---|
| Conformer-S (10M) | ASR | 8GB | <1GB |
| Conformer-M (30M) | ASR | 12GB | 2GB |
| Whisper Large v3 | ASR | β (pretrained) | 6GB |
| FastSpeech 2 | TTS | 8GB | <1GB |
| VITS | TTS | 12GB | 2GB |
| HiFi-GAN | Vocoder | 8GB | <1GB |
| VALL-E style | TTS | 40GB+ | 8GB+ |
| Whisper large fine-tune | ASR | 24GB | 6GB |
8.3 Production Inference Hardware
CPU-Only (Lightweight)
Use case: Low-traffic, edge, embedded
Hardware: Modern x86 CPU with AVX2
Tools: ONNX Runtime, OpenVINO, CTranslate2 CPU
Models: Whisper tiny/base, Kokoro TTS
Latency: 1β5x real-time (RTF > 1)
GPU Server (Production)
Use case: High-traffic API service
Hardware: NVIDIA T4 ($0.35/hr on AWS), A10G, RTX 4090
Tools: Triton Server, TensorRT, CTranslate2 GPU
Models: Whisper large, VITS, XTTS
Latency: 0.1β0.3x real-time (RTF 0.1β0.3)
Edge Devices
NVIDIA Jetson Orin: On-device AI, 16-64GB unified memory
Apple Silicon M2/M3: Metal GPU, excellent for CoreML models
Raspberry Pi 5: Light STT only (Vosk, Whisper tiny)
Android/iOS: TFLite, ONNX Mobile models
9. BUILDING YOUR OWN SERVICE
9.1 System Architecture Overview
βββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β Web App | Mobile | API Consumer β
ββββββββββββββββββββ¬βββββββββββββββββββ
β HTTPS / WebSocket
ββββββββββββββββββββΌβββββββββββββββββββ
β API GATEWAY β
β Rate Limiting | Auth | Load Balance β
ββββββββ¬βββββββββββββββββββ¬ββββββββββββ
β β
βββββββββββββββββββΌβββ ββββββββββββΌβββββββββββββββ
β STT Service β β TTS Service β
β FastAPI + Whisper β β FastAPI + VITS/XTTS β
βββββββββββ¬βββββββββββ ββββββββββββ¬βββββββββββββββ
β β
βββββββββββΌβββββββββββββββββββββββββββΌβββββββββββββββ
β INFERENCE LAYER β
β GPU Workers (Triton / CTranslate2 / ONNX) β
βββββββββββ¬βββββββββββββββββββββββββββ¬βββββββββββββββ
β β
βββββββββββΌββββββββ βββββββββββββΌββββββββββββ
β Message Queue β β Model Registry β
β (Redis/Kafka) β β (MLflow / S3) β
βββββββββββββββββββ βββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β STORAGE LAYER β
β Audio Storage (S3/GCS) | DB (PostgreSQL) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
9.2 STT Service Implementation
FastAPI STT Service
from fastapi import FastAPI, UploadFile, File, WebSocket
from fastapi.responses import JSONResponse
import torchaudio
import io
from faster_whisper import WhisperModel
app = FastAPI(title="STT Service")
# Load model at startup
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
@app.post("/transcribe")
async def transcribe(
file: UploadFile = File(...),
language: str = "en",
task: str = "transcribe" # or "translate"
):
# Read uploaded audio
audio_bytes = await file.read()
audio_buffer = io.BytesIO(audio_bytes)
# Transcribe
segments, info = model.transcribe(
audio_buffer,
language=language,
task=task,
beam_size=5,
word_timestamps=True
)
result = {
"language": info.language,
"language_probability": info.language_probability,
"duration": info.duration,
"segments": [
{
"start": s.start,
"end": s.end,
"text": s.text,
"words": [{"word": w.word, "start": w.start, "end": w.end}
for w in (s.words or [])]
}
for s in segments
]
}
return JSONResponse(result)
@app.websocket("/stream")
async def stream_transcribe(websocket: WebSocket):
await websocket.accept()
# Streaming implementation
buffer = b""
async for data in websocket.iter_bytes():
buffer += data
if len(buffer) >= 32000 * 2: # 2 seconds of 16kHz int16
# Process chunk
audio = np.frombuffer(buffer, dtype=np.int16).astype(np.float32) / 32768.0
segments, _ = model.transcribe(audio, language="en")
for seg in segments:
await websocket.send_json({"text": seg.text, "final": False})
buffer = b""
Dockerized STT Service
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3-pip ffmpeg
WORKDIR /app
COPY requirements.txt .
RUN pip install faster-whisper fastapi uvicorn python-multipart
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
9.3 TTS Service Implementation
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from TTS.api import TTS
import io
import soundfile as sf
import numpy as np
app = FastAPI(title="TTS Service")
# Load model
tts = TTS("tts_models/en/ljspeech/vits", gpu=True)
@app.post("/synthesize")
async def synthesize(
text: str,
speaker_id: int = 0,
speed: float = 1.0,
format: str = "wav"
):
# Generate audio
wav = tts.tts(text=text, speaker=speaker_id, speed=speed)
# Convert to bytes
buffer = io.BytesIO()
sf.write(buffer, np.array(wav), 22050, format=format.upper())
buffer.seek(0)
return StreamingResponse(
buffer,
media_type=f"audio/{format}",
headers={"Content-Disposition": f"attachment; filename=speech.{format}"}
)
@app.post("/synthesize/stream")
async def synthesize_stream(text: str):
"""Stream audio chunks as they're generated"""
async def generate():
for sentence in split_into_sentences(text):
wav = tts.tts(text=sentence)
audio_bytes = wav_to_bytes(wav)
yield audio_bytes
return StreamingResponse(generate(), media_type="audio/wav")
9.4 Voice Cloning Service
# Using XTTS for zero-shot voice cloning
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
config = XttsConfig()
config.load_json("XTTS-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="XTTS-v2/", eval=True)
model.cuda()
@app.post("/clone")
async def clone_voice(
text: str,
reference_audio: UploadFile = File(...),
language: str = "en"
):
# Save reference audio temporarily
ref_bytes = await reference_audio.read()
ref_path = f"/tmp/{uuid.uuid4()}.wav"
with open(ref_path, "wb") as f:
f.write(ref_bytes)
# Compute speaker latents
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=[ref_path]
)
# Synthesize
out = model.inference(
text=text,
language=language,
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.7
)
buffer = io.BytesIO()
sf.write(buffer, out["wav"], 24000, format="WAV")
buffer.seek(0)
return StreamingResponse(buffer, media_type="audio/wav")
9.5 Deployment with Docker Compose
version: '3.8'
services:
stt-service:
build: ./stt
ports: ["8001:8000"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
- ./models:/models
environment:
- MODEL_PATH=/models/whisper-large-v3
tts-service:
build: ./tts
ports: ["8002:8000"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
- ./models:/models
nginx:
image: nginx:alpine
ports: ["80:80", "443:443"]
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
- ./certs:/etc/ssl/certs
depends_on: [stt-service, tts-service]
redis:
image: redis:7-alpine
ports: ["6379:6379"]
prometheus:
image: prom/prometheus
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
10. BUILD IDEAS: BEGINNER β ADVANCED
π’ BEGINNER LEVEL (0β3 months)
Project 1: Voice Recorder + Transcriber
BEGINNER
Stack: Python + OpenAI Whisper + sounddevice
- Record from microphone with sounddevice
- Save as WAV file
- Pass to Whisper for transcription
- Save transcript as text file
Learning: Audio I/O, Whisper API, file handling
Project 2: Text-to-Speech Converter
BEGINNER
Stack: Python + Coqui TTS or pyttsx3
- Accept text input via CLI
- Generate speech audio
- Play back or save to file
- Support multiple voices
Learning: TTS API, audio output, voice selection
Project 3: Meeting Transcriber
BEGINNER
Stack: Python + Whisper + pyaudio + tkinter
- Simple GUI with start/stop recording
- Real-time transcription display
- Export to .txt or .docx
- Speaker diarization (basic)
Learning: GUI, audio streaming, file export
Project 4: Language Learning Pronunciation Checker
BEGINNER
Stack: Python + Whisper + phonemizer
- User reads a sentence aloud
- Compare recognized phonemes to expected phonemes
- Score pronunciation accuracy
- Highlight mispronounced words
Learning: Phoneme comparison, scoring, feedback
π‘ INTERMEDIATE LEVEL (3β8 months)
Project 5: Real-Time Transcription Web App
INTERMEDIATE
Stack: FastAPI + WebSocket + Whisper + React
- Browser captures microphone stream (MediaRecorder API)
- Streams audio chunks to WebSocket server
- Server transcribes chunks with streaming Whisper
- Live captions displayed in browser
- Export transcript feature
Learning: WebSockets, browser audio API, streaming inference
Project 6: Podcast TTS Generator
INTERMEDIATE
Stack: FastSpeech2 + HiFi-GAN + FastAPI + React
- Input: article URL or text
- Extract text from URL (newspaper3k)
- Normalize text (numbers, abbreviations)
- Generate speech with FastSpeech2 + HiFi-GAN
- Return downloadable MP3
- Add playback controls in UI
Learning: TTS pipeline, text normalization, web scraping, audio encoding
Project 7: Voice Command System
INTERMEDIATE
Stack: Whisper + intent classification + TTS response
- Wake word detection (Picovoice Porcupine or custom)
- STT for command capture
- Intent extraction (fine-tuned BERT or regex)
- Execute command (volume, calendar, search, etc.)
- TTS response
Learning: Wake word detection, intent classification, action execution
Project 8: Multi-Speaker Diarization + Transcription
INTERMEDIATE
Stack: Whisper + pyannote.audio + spaCy
- Transcribe audio with word timestamps
- Run speaker diarization (pyannote)
- Assign speakers to transcribed words
- Output: "Speaker 1: Hello... Speaker 2: Hi..."
- Format as readable transcript
Learning: Diarization, timestamp alignment, NLP post-processing
Project 9: Fine-Tuned Domain ASR
INTERMEDIATE
Stack: Whisper + HuggingFace + custom medical/legal corpus
- Collect domain-specific audio+transcripts
- Fine-tune Whisper small or medium
- Evaluate domain-specific WER improvement
- Deploy via FastAPI
Learning: Transfer learning, dataset preparation, evaluation, deployment
Project 10: Custom Voice Cloner
INTERMEDIATE
Stack: XTTS-v2 or YourTTS + FastAPI
- API endpoint: /clone with {text, reference_audio}
- Accept 5β15s reference audio
- Generate speech in cloned voice
- Multiple language support
Learning: Voice cloning, speaker embeddings, API design
π΄ ADVANCED LEVEL (8β18 months)
Project 11: Production STT Service (Commercial Grade)
ADVANCED
Features:
- Multi-language detection and transcription
- Real-time streaming (WebSocket) + batch API (REST)
- Speaker diarization
- Custom vocabulary / hotwords boosting
- Punctuation and capitalization restoration
- Confidence scores per word
- Webhook callbacks for async jobs
- Dashboard: usage, latency, error rates
Stack: Whisper large + NeMo + pyannote + FastAPI + Redis + PostgreSQL +
Prometheus + Grafana + Kubernetes + Nginx
Scaling: Horizontal pod autoscaling based on GPU queue depth
Project 12: Production TTS Service (API like ElevenLabs)
ADVANCED
Features:
- 20+ pre-built voices with distinct personalities
- Zero-shot voice cloning from <30s reference
- Emotion/style control (happy, sad, excited, whisper)
- SSML support (rate, pitch, emphasis, break)
- Streaming audio generation
- 20+ language support
- REST API + Python/JS SDK
- Usage billing integration
Stack: VITS + XTTS + HiFi-GAN + FastAPI + Stripe + Redis + S3 + CDN
Project 13: Voice Conversion System
ADVANCED
Features:
- Convert speaker identity while preserving content
- Any-to-any voice conversion
- Real-time capability (<200ms latency)
Architecture:
Input audio β ASR (content) + Speaker encoder (style)
β Voice decoder β Target voice audio
Models: FreeVC, DDSP-VC, Diff-VC, QuickVC
Learning: Disentanglement, speaker representation, real-time processing
Project 14: End-to-End Speech Translation
ADVANCED
Architecture:
Source language audio β SeamlessM4T / NLLB-Audio
β Target language text (or audio)
Features:
- Direct speech-to-speech translation (no text intermediate)
- 100+ language pairs
- Real-time streaming
- Preserve prosody/emotion in output
Stack: SeamlessM4T (Meta) + FastAPI + WebSocket
Project 15: Train Your Own TTS from Scratch
ADVANCED
Steps:
1. Record 20+ hours of custom voice in studio
2. Segment and transcribe all audio (Whisper-assisted)
3. Train FastSpeech2 acoustic model from scratch
4. Train HiFi-GAN vocoder from scratch
5. Fine-tune VITS end-to-end
6. Implement MOS (Mean Opinion Score) evaluation
7. A/B test against Coqui/ElevenLabs
Learning: Full training pipeline, data curation, model evaluation, production deployment
π΅ RESEARCH / EXPERT LEVEL (18+ months)
Project 16: Codec Language Model TTS (VALL-E style)
RESEARCH
Architecture:
Text β Phonemes β Token sequence β AR Transformer β Coarse codec tokens
β NAR Transformer β Fine codec tokens β EnCodec decoder β Audio
Training:
- Pretrain on 10,000+ hours of diverse speech
- EnCodec tokenizer (8 codebooks, 75Hz)
- GPT-style LM for coarse tokens
- BERT-style masked model for fine tokens
Innovation opportunities:
- Better alignment between text and audio tokens
- Emotion conditioning
- Efficiency improvements
Project 17: Streaming On-Device STT (Mobile)
RESEARCH
Target: <50ms latency, <100MB model, runs on phone CPU
Approach:
- Start with Conformer-Tiny + CTC
- Quantize to INT8
- Optimize with TFLite delegate or CoreML
- Implement streaming chunk processing
- Add on-device LM rescoring (tiny n-gram)
Platforms: Android (TFLite) + iOS (CoreML)
Learning: Mobile ML optimization, quantization, edge deployment
Project 18: Multilingual Universal Speech Model
RESEARCH
Scope: Single model for 50+ languages, STT + TTS
STT:
- Pretrain wav2vec 2.0 on 50-language corpus
- Fine-tune with multilingual CTC
- Adapter modules per language
TTS:
- Shared phoneme inventory across languages
- Language embedding conditioning
- Cross-lingual transfer for low-resource languages
Evaluation: FLEURS benchmark across all languages
11. CUTTING-EDGE DEVELOPMENTS (2023β2025)
11.1 Speech Recognition
- Whisper Large v3 Turbo (2024): 8-layer decoder, 809M params, faster than large-v2 with similar accuracy
- Distil-Whisper (2023, Hugging Face): 6x speedup, 49% fewer params, <1% WER degradation
- Universal-1 (AssemblyAI, 2024): SOTA commercial STT, best on noisy data
- Gemini Audio (Google, 2024): Natively multimodal, audio reasoning
- Canary-1B (NVIDIA, 2024): Conformer + attention, multilingual, speech translation
- MMS (Meta, 2023): 1000+ language STT using one model
- OWSM (2024): Open-source replica of Whisper at larger scale (25k hours+)
- parakeet-tdt (NVIDIA, 2024): Token-and-duration transducer, near real-time
11.2 Text-to-Speech
- VALL-E 2 (Microsoft, 2024): First TTS to match human quality on VCTK/LibriSpeech, using GRP
- NaturalSpeech 3 (Microsoft, 2024): FACodec disentanglement + diffusion
- CosyVoice (Alibaba, 2024): LLM-based TTS with flow matching
- F5-TTS (2024): Flow matching TTS, DiT architecture, flat text input, SOTA
- E2-TTS (2024): Simple flow-matching TTS, impressive quality
- FireRedTTS (2024): High-quality Chinese TTS
- Kokoro (2024): Small (82M), fast, open-weights, near-SOTA quality
- StyleTTS 2 (2023): Diffusion + style modeling, SOTA on LJ Speech
- Parler-TTS (2024): Natural language description controls TTS voice
- Amphion (2024): Unified open-source TTS/VC/SVC framework
- HierSpeech++ (2024): Hierarchical variational inference, high quality
11.3 Voice & Audio Foundation Models
- EnCodec (Meta, 2022): Neural audio codec, 24kHz, residual VQ
- DAC (Descript, 2023): Improved neural codec, better perceptual quality
- AudioPaLM (Google, 2023): Multimodal LLM combining speech + text
- SpeechX (Microsoft, 2023): Unified speech model for many tasks
- UniAudio (2023): One model for 11 audio tasks
- VoxtLM (2024): Language model for joint speech-text
- Spirit LM (Meta, 2024): Interleaved speech-text LLM with expressive speech
11.4 Voice Cloning & Conversion
- OpenVoice v2 (2024): Near-zero-shot cloning, tone/style/accent control
- XTTS v2 (Coqui, 2023): 17-language voice cloning, 6s reference
- RVC v2: Real-Time Voice Conversion, widely used for singing conversion
- So-VITS-SVC: Singing voice conversion based on VITS
- Seed-TTS (ByteDance, 2024): Near-perfect voice cloning, emotional control
11.5 Real-Time & Streaming
- Moshi (Kyutai, 2024): Real-time full-duplex speech dialogue system
- RealtimeTTS: Python library for ultra-low-latency streaming TTS
- moonshine (Useful Sensors, 2024): On-device STT, faster than Whisper tiny
- whisper.cpp: C++ Whisper, runs on CPU, iOS, Android, Raspberry Pi
11.6 Key Research Directions (2025+)
- Speech LLMs: End-to-end spoken dialogue models (like GPT-4o audio)
- Zero-shot multilingual TTS: One model, any language, any voice
- Codec-based unified models: Everything tokenized as audio codes
- On-device streaming: Sub-100ms full-stack STT+TTS on mobile
- Emotional speech: Expressive control beyond speed/pitch
- Personalization: Continuous adaptation from user speech
- Anti-spoofing: Detecting deepfake audio (ADD challenge)
12. RESOURCES, DATASETS & REFERENCES
12.1 Key Research Papers (Read in Order)
STT Papers
- "A tutorial on hidden Markov models" (Rabiner, 1989)
- "Deep Speech: Scaling up end-to-end speech recognition" (Baidu, 2014)
- "Connectionist Temporal Classification" (Graves et al., 2006)
- "Attention-Based Models for Speech Recognition" (Chorowski, 2015)
- "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech" (Meta, 2020)
- "HuBERT: Self-Supervised Speech Representation Learning" (Meta, 2021)
- "Conformer: Convolution-augmented Transformer for SR" (Google, 2020)
- "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper, OpenAI, 2022)
- "Distil-Whisper: Robust Knowledge Distillation" (Hugging Face, 2023)
TTS Papers
- "WaveNet: A Generative Model for Raw Audio" (DeepMind, 2016)
- "Tacotron: Towards End-to-End Speech Synthesis" (Google, 2017)
- "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions" (Tacotron 2, 2018)
- "FastSpeech: Fast, Robust and Controllable TTS" (Microsoft, 2019)
- "FastSpeech 2: Fast and High-Quality E2E TTS" (Microsoft, 2020)
- "HiFi-GAN: Generative Adversarial Networks for Audio Synthesis" (2020)
- "VITS: Conditional Variational Autoencoder with Adversarial Learning for E2E TTS" (2021)
- "VALL-E: Neural Codec Language Models are Zero-Shot TTS" (Microsoft, 2023)
- "Voicebox: Text-Guided Multilingual Universal Speech Generation" (Meta, 2023)
- "NaturalSpeech 3: Zero-Shot Copier-Free Voice Cloning" (Microsoft, 2024)
- "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech" (2024)
12.2 Datasets
ASR Datasets
- LibriSpeech β 960h English, clean+noisy (openslr.org/12)
- CommonVoice 17 β Mozilla, 100+ languages (commonvoice.mozilla.org)
- VoxPopuli β 1791h EU parliament (github.com/facebookresearch/voxpopuli)
- GigaSpeech β 10,000h English diverse (github.com/SpeechColab/GigaSpeech)
- AISHELL-1/2 β Mandarin Chinese (openslr.org/33)
- MCV Corpus β Hindi, Marathi, Tamil, Telugu + many Indian languages
- MUSAN β Noise, music, speech for augmentation
- RIR_NOISES β Room impulse responses
- FLEURS β Google, 100 languages, 12h each
TTS Datasets
- LJ Speech β 24h single speaker, high quality (keithito.com/LJ-Speech-Dataset)
- VCTK β 109 speakers, English (datashare.ed.ac.uk)
- LibriTTS β 585h multi-speaker, clean (openslr.org/60)
- HiFi-TTS β 291h high quality multi-speaker
- AISHELL-3 β 85h Mandarin multi-speaker
- CSS10 β 10 languages, single speaker each
- Kokoro dataset β High-quality curated English
- ESD β Emotional speech dataset (5 emotions, 10 speakers)
12.3 Pre-trained Models to Start With
STT
- openai/whisper-large-v3 (HuggingFace)
- nvidia/parakeet-tdt-1.1b (HuggingFace)
- facebook/wav2vec2-large-960h-lv60 (HuggingFace)
- speechbrain/asr-conformer-... (SpeechBrain Hub)
TTS
- tts_models/en/ljspeech/vits (Coqui TTS)
- tts_models/multilingual/multi-dataset/xtts_v2 (Coqui)
- hexgrad/Kokoro-82M (HuggingFace)
- facebook/mms-tts-eng (HuggingFace)
- suno/bark (HuggingFace)
Speaker
- speechbrain/spkrec-ecapa-voxceleb (Speaker verification)
- pyannote/speaker-diarization-3.1 (Diarization)
12.4 Courses & Learning Resources
COURSES
- Stanford CS224S: Spoken Language Processing (free online)
- Fast.ai Practical Deep Learning (free)
- DeepLearning.AI Sequence Models (Coursera)
- CMU 11-751 Speech Recognition (lecture slides free)
- Hugging Face Audio Course (huggingface.co/learn/audio-course β FREE)
BOOKS
- "Speech and Language Processing" β Jurafsky & Martin (free PDF: web.stanford.edu/~jurafsky/slp3)
- "Fundamentals of Speech Recognition" β Rabiner & Juang
- "Deep Learning" β Goodfellow, Bengio, Courville (free: deeplearningbook.org)
- "Neural Network Methods for NLP" β Goldberg
KEY BLOGS & RESOURCES
- Lilian Weng's Blog (lilianweng.github.io) β Excellent deep dives
- Papers With Code (paperswithcode.com/task/speech-synthesis)
- Hugging Face Blog
- NVIDIA Developer Blog
- Distill.pub β Visual explanations
COMMUNITIES
- r/MachineLearning (Reddit)
- Hugging Face Discord
- ESPnet GitHub Discussions
- SpeechBrain Slack
- ML Discord servers
12.5 Benchmarks & Evaluation Tools
STT Benchmarks
- LibriSpeech test-clean: Target WER < 2.5% (SOTA ~1.4%)
- LibriSpeech test-other: Target WER < 5% (SOTA ~2.7%)
- CommonVoice: Multilingual WER
- Earnings21: Real-world earnings call transcription
- CHiME-6: Noisy far-field challenge
- NOIZEUS: Noise robustness
TTS Evaluation
- MOS (Mean Opinion Score): Human evaluation 1-5 scale
- UTMOS: Automatic MOS predictor (neural)
- DNSMOS P.835: Noise/speech quality
- SpeechBERTScore: Semantic similarity
- PESQ: Perceptual speech quality
- STOI: Short-Time Objective Intelligibility
- F0 RMSE: Pitch prediction accuracy
- MCD (Mel Cepstral Distortion): Acoustic similarity
Tools for Evaluation
# WER calculation
pip install jiwer
from jiwer import wer, cer
# MOS prediction
pip install speechmos
from speechmos import dnsmos
# PESQ and STOI
pip install pesq pystoi
# Forced alignment (for TTS duration evaluation)
pip install montreal-forced-aligner
12.6 QUICK START CHECKLIST
Week 1: Environment Setup
- Install Python 3.10+, CUDA, PyTorch with GPU support
- Install librosa, torchaudio, transformers, TTS (Coqui)
- Download and run Whisper on a test file
- Run Coqui TTS on a test sentence
- Plot a mel spectrogram from scratch
Week 2β4: Foundations
- Implement MFCC from scratch (no librosa)
- Implement simple HMM for digit recognition
- Fine-tune Whisper base on a custom 1-hour dataset
- Run VITS inference on LJ Speech
Month 2β3: First Models
- Train FastSpeech 2 on LJ Speech (2β3 days on RTX 3090)
- Train HiFi-GAN vocoder
- Build a REST API for STT and TTS
- Build a simple web UI for your service
Month 4β6: Production Service
- Containerize with Docker
- Add authentication, rate limiting
- Add monitoring (Prometheus + Grafana)
- Deploy to cloud (AWS/GCP/Lambda Labs)
- Achieve <300ms TTS latency
Month 7β12: Specialization
- Choose: voice cloning, multilingual, on-device, or real-time streaming
- Train a model from scratch on custom data
- Publish an open-source project or demo
- Read 5+ papers from the cutting-edge list